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PLn Abstract 

0^ In this letter we borrow from the inference techniques developed for un- 

^^ bounded state-cardinality (nonparametric) variants of the HMM and use 

^_^ them to develop a tuning-parameter free, black-box inference procedure 

1 for Explicit-state-duration hidden Markov models (EDHMM). EDHMMs 

^H are HMMs that have latent states consisting of both discrete state-indicator 

f^i and discrete state-duration random variables. In contrast to the implicit 

^_» geometric state duration distribution possessed by the standard HMM, 

^ EDHMMs allow the direct parameterisation and estimation of per-state 

C/3 duration distributions. As most duration distributions are defined over 

' ' the positive integers, truncation or other approximations are usually re- 

, quired to perform EDHMM inference. 

> 

00 

(^ 1 Introduction 

O 

<^^ Hidden Markov models (HMMs) are a fundamental tool for data analysis and 

jy-^ exploration. Many variants of the basic HMM have been developed in response 

f^ to shortcomings in the original HMM formulation [9^ . In this paper we address 

CN inference in the explicit state duration HMM (EDHMM). By state duration we 

'""' mean the amount of time an HMM dwells in a state. In the standard HMM 

^ specification, a state's duration is implicit and, a priori, distributed geometri- 

cally. 

The EDHMM (or, equivalently, the hidden semi-Markov model [12]) was 
C^ developed to allow explicit parameterization and direct inference of state dura- 

tion distributions. EDHMM estimation and inference can be performed using 
the forward-backward algorithm; though only if the sequence is short or a tight 
"allowable" duration interval for each state is hard-coded a priori jl3j. If the 
sequence is short then forward-backward can be run on a state representation 
that allows for all possible durations up to the observed sequence length. If the 
sequence is long then forward-backward only remains computationally tractable 
if only transitions between durations that lie within pre-spccified allowable in- 
tervals are considered. If the true state durations lie outside those intervals then 
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Figure 1: a) The Explicit Duration Hidden Markov Model. The time left in the 
current state xt is denoted dt . The observation at each point in time is denoted 
Ut ■ b) The EDHMM with the additional auxiliary variable Ut used in the beam 
sampler. 



the resulting model estimates will be incorrect: the learned duration distribu- 
tions can only reflect what is allowed given the pre-specified duration intervals. 

Our contribution is the development of a procedure for EDHMM inference 
that does not require any hard pre-specification of duration intervals, is efficient 
in practice, and, as it is an asymptotically exact procedure, does not risk in- 
correct inference. The technique we use to do this is borrowed from sampling 
procedures developed for nonparametric Bayesian HMM variants [11]. Our key 
insight is simple: the machinery developed for inference in HMMs with a count- 
able number of states is precisely the same as that which is needed for doing 
inference in an EDHMM with duration distributions over countable support. So, 
while the EDHMM is a distinctly parametric model, the tools from nonpara- 
metric Bayesian inference can be applied such that black-box inference becomes 
possible and, in practice, efficient. 

In this work we show specifically that a "beam-sampling" approach [TT] 
works for estimating EDHMMs, learning both the transition structure and du- 
ration distributions simultaneously. In demonstrating our EDHMM inference 
technique we consider a synthetic system in which the state-cardinality is known 
and finite, but where each state's duration distribution is unknown. We show 
that the EDHMM beam sampler performs accurate tracking whilst capturing 
the duration distributions as well as the probability of transitioning between 
states. 

The remainder of the letter is organised as follows. In Section[2]wc introduce 
the EDHMM; in Section [3] we review beam-sampling for the infinite Hidden 
Markov Model (iHMM) [T] and show how it relates to the EDHMM inference 
problem; and in Section |4] we show results from using the EDHMM to model 
synthetic data. 



2 Explicit Duration Hidden Markov Model 

The EDHMM captures the relationships among state xt, duration dt, and ob- 
servation yt over time t. It consists of four components: the initial state dis- 
tribution, the transition distributions, the observation distributions, and the 
duration distributions. 

We define the observation sequence y = {yi, j/2, • • • , J/t}; the latent state 
sequence X = {xo,xi,X2, ■ ■ ■ ,xt}', and the remaining time in each segment 
V = {di, d2, . . . , dr}, where Xt € {1, 2, . . . , K} with K the maximum number 
of states, dt G {1,2,...}, and yt G K". We assume that the Markov chain on 
the latent states is homogenous, i.e., that p{xt = j\xt-i = i, A) = aij^t where 
A is a K X K matrix with element aij at row i and column j. The prior on 
A is row-wise Dirichlet with zero prior mass on self-transitions, i.e. p(ai.:) = 
Dir{l/{K — 1), . . . , 0, . . . 1/K — 1) where ai^. is a row vector and the ith Dirichlet 
parameter is 0. Each state is imbued with its own duration distribution p((it|a;t = 
k) = p{dt\Xk) with parameter A^. Each duration distribution parameter is 
drawn from a prior p{Xk) which can be chosen in an application specific way. 
The collection of all duration distribution parameters is A = {Ai, . . . , A^}. Each 
state is also imbued with an observation generating distribution p(yt\xt = k) = 
piyt\(^k) with parameter Ok- Each observation distribution parameter is drawn 
from a prior p{9k) also to be chosen according to the application. The set of 
all observation distribution parameters is 9. In the following exposition, explicit 
conditional dependencies on component distribution parameters are omitted to 
focus on the particulars unique to the EDHMM. 

In an EDHMM the transitions between states are only allowed at the end of 
a segment: 

, : , X jS{xt,xt-i) iidt-i>l 

p{xt\xt-i,dt-.i) = < . I ^ ^, . (1) 

\p(xt\xt~i) otherwise 

where the Kronecker delta S(a, b) = I ii a = b and zero otherwise. The duration 
distribution generates segment lengths at every state switch: 

., I , . f5{dt,dt-i-l) ifdt_i>l 
p{dt\xt,dt^i) = < . (2) 

\p[dt\xt) otherwise. 

The joint distribution of the EDHMM is 
p{X,V,y)^p{xo)p{dQ) 

T 

Y[ P{yt\xt , d)p{xt\xt-i, dt-i, A)p{dt\xt, dt-i, A) (3) 
t=i 

corresponding to the graphical model in Figure [Ia[ Alternative choices to define 
the duration variable dt exist; see [3] for details. Algorithm [l] illustrates the 
EDHMM as a generative model. 



Algorithm 1 Generate Data 

sample xq -- p{xq), do ^ pido) 
for t = l,2,...,Tdo 
if dt-i = 1 then 

a new segment starts: 
sample xt ~ p{xt\xt-i) 
sample dt ^ p{dt\xt) 
else 

the segment continues: 

Xt = Xt^i 

dt = dt-i - 1 
end if 

sample yt ^ p{yt\xt) 
end for 



Algorithm 2 Sample the EDHMM 

Initialise parameters A, A, 9. Initialize Ut small VT 
for sweep e {1, 2, 3, . . .} do 

Forward: run (ItI) to get at{zt) given U and y\/T 

Backward: sample zt ~ cxTizr) 
for ie {T,T-1,...,1} do 

sample zt-i ~ l{ut < p{zt\zt-i))at~i{zt~i) 
end for 
Slice: 
for t e {1,2,...,T} do 

evaluate I ^ p{dt\xt,dt-^i)p{xt\xt-^i,dt-i) 

sample Ut ^ Uniform(0, 1) 
end for 

sample parameters A, X, 6 
end for 



3 EDHMM Inference 

Our aim is to estimate the conditional posterior distribution of the latent states 
{X and V) and parameters {9, X and A) given observations y by samples drawn 
via Markov chain Monte Carlo. Sampling 9 and A given X proceeds per usual 
textbook approaches [2]. Sampling A given V is straightforward in most situ- 
ations. Indirect Gibbs sampling of X is possible using auxiliary state-change 
indicator variables, but for reasons similar to those in [6], such a sampler will 
not mix well. The main contribution of this paper is to show how to generate 
posterior samples of X and T>. 



3.1 Forward Filtering, Backward Sampling 

We can, in theory, use the forward messages from the forward backward algo- 
rithm (2 to sample the conditional posterior distribution of X and V. To do this 
we treat each state-duration tuple as a single random variable (introducing the 
notation Zj — {xt,dt}). Doing so recovers the standard hidden Markov model 
structure and hence standard forward messages can be used directly. A for- 
ward filtering, backward sampler for Z — {zi, . . . , zt} conditioned on all other 
random variables requires the classical forward messages: 



Q-tizt) = y^^ p{zt\zt-i)p{yt\zt)at-iizt-i) 



(4) 



where the transition probability can be factorised according to our modelling 
assumptions: 

p{zt\zt^i) ^ p{xt\xt-i,dt-i)p{dt\dt-i,xt). (5) 



Unfortunately the sum in Q has at worst an infinite number of terms in the 
case of duration distributions with countably infinite support and at best a very 
large number of terms in the case of long sequences. The standard approach to 
EDHMM inference involves truncating considered durations to only those that 
lie between dj^in and dmax or computation involving all possible durations up 
to the observed length of the sequence (dmin = 0, dmax — T). This leads to per- 
sample, forward-backward computational complexity of 0(T(iir(dmax — c^min))^)- 
Truncation yields inference that will simply fail if an actual duration lies outside 
hard-coded allowable duration intervals. Considering all possible durations up 
to length T is often computationally impossible. The beam-sampler we propose 
behaves like a dynamic version of the truncation approach, automatically defin- 
ing and scaling per-state duration truncation intervals. Better though, the way 
it does this results in an asymptotically exact sample with no risk of incorrect 
inference resulting from incorrectly pre-specified duration truncations. We do 
not characterize the computational complexity of the proposed beam sampler 
in this work but note that it is upper bounded by 0{T{KT)'^) (i.e., the beam 
sampler admits durations of length equal to the entire sequence) but in practice 
is found to be as or more efficient than the risky hard-truncation approach. 

3.2 EDHMM Beam Sampling 

A recent contribution to inference in the infinite Hidden Markov Model (iHMM) 
[T] suggests a way around truncation [11 . The iHMM is an HMM with a 
countable number of states. Computing the forward message for a forward 
filtering, backward sampler for the latent states in an iHMM also requires a 
sum over a countable number of elements. The "beam sampling" approach jllj . 
which we can apply largely without modification, is to truncate this sum by 
introducing a "slice" [T auxiliary variable U = {ui,U2, ■ ■ ■ ,ut} at each time 
step. The auxiliary variables are chosen in such a way as to automatically limit 
each sum in the forward pass to a finite number of terms while still allowing all 
possible durations. 

The particular choice of auxiliary variable Ut is important. We follow [llj in 
choosing ut to be conditionally distributed given the current and previous state 



and duration in the following way (see the graphical model in Figure lb I : 



p{ut\zu zt-i) ^ 1—. ^ (6) 

p{zt\zt-i) 

where !(•) returns one if its operand is true and zero otherwise. Given lA it 
is possible to sample the state X and duration V conditional posterior. Using 
notation yl^ — {ytj^,yti+i, ■ ■ ■ jVt^} to indicate sub-ranges of a sequence, the 
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new forward messages we compute are: 

'^p{ut\zt,zt-i)p{zt,zt^i,ylMl~^) 

= ^ 1(0 < ut < p{zt\zt-i))p{yt\zt)at-i{zt-i). 

The indicator function I results in non-zero probabilities in the forward message 
for only those states Zt whose likelihood given zt-i is greater than ut- The beam 
sampler derives its computational advantage from the fact that the set of z^'s 
for which this is true is typically small. 

The backwards sampling step recursively samples a state sequence from the 
distribution p[zt-i\zt^y ^lA) which can expressed in terms of the forward vari- 
able: 

p[zt-i\zt,y M) « p{zt,zt-i,yM) (8) 

ex p{ut\zt, zt^i)p(zt\zt-i)at-i{zt-i) 
ex 1{Q <ut<p{zt\zt-i))at^i{zt-i). 

The full EDHMM beam sampler is given in Algorithm [2J which makes use of 
the forward recursion in ([7| , the slice sampler in (l6| , and the backwards sampler 
in ^. 

3.3 Related Work 

The need to accommodate explicit state duration distributions in HMMs has 
long been recognised. Rabiner ^9 details the basic approach which expands 
the state space to include dwell time before applying a slightly modified Baum- 
Welch algorithm. This approach specifies a maximum state duration, limit- 
ing practical application to cases with short sequences and dwell times. This 
approach, generalised under the name "segmental hidden Markov models" , in- 
cludes more general transitions than those Rabiner considered, allowing the 
next state and duration to be conditioned on the previous state and duration 
[5]. Efficient approximate inference procedures were developed in the context 
of speech recognition [8] , speech synthesis [M] , and evolved into symmetric ap- 
proaches suitable for practical implementation 13 . Recently, a "sticky" variant 
of the hierarchical Dirichlet process HMM (HDP-HMM) has been developed [J] . 
The HDP-HMM has countable state-cardinality [TD] allowing estimation of the 
number of states in the HMM; the sticky aspect addresses long dwell times by 
introducing a parameter in the prior that favours self-transition. 
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Figure 2: Example a) state and b) observation sequence generated by the explicit 
duration HMM. Here K — 3; p{yt\xt — j) — N(^j, 1) with ^i = —3, /i2 = 0, 
and /i3 = 3; and p{dt\xt — j) = Poisson(Aj) with Ai = 5, A2 = 15, and A3 = 20. 



4 Experiments 

4.1 Synthetic Data 

The first experiment uses the 500 data points (Figure [2]) generated from a three 
state EDHMM. The duration distributions were Poisson with rates Ai = 5, 
A2 — 15, A3 = 20; each observation distribution was Gaussian with means of 
Hi = —3, /X2 = 0, and /i3 = 3, each with a variance of 1. The transition 
distributions A were set to 
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Broad, uninformative priors were chosen for the parameters of the dura- 
tion and observation distributions. The observation distribution parameters 
were given a normal-inverse- Wishart (N-IW) prior with parameters vq = 2, 
Aq = 1, K = 0.1 and /^o = 0. The rate parameters for all states were given 
Gamma (1,10^) priors. 

One thousand samples were collected from the EDHMM beam sampler after 
a burn- in of 500 samples. The learned posterior distribution of the state duration 
parameters and means of the observation distributions are shown in Figure [31 
The EDHMM achieves high accuracy in the estimated posterior distribution of 
the observation means, despite the overlap in observation distributions. The 
rate parameter distributions are reasonably estimated given the small number 
of observed segments. Figure |4] shows the mean number of transitions visited 
per time point over each iteration of the sampler. 

A second experiment was performed to demonstrate the ability of the EDHMM 
to distinguish between states having differing duration distributions but the 
same observation distribution. The same model and sampling procedure was 
used as above except here ^1 = 0, /i2 = 0, and ^^ — 3. Figure [5] shows that 
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Figure 3: Samples from the posterior distributions of a) the observation dis- 
tribution means and b) the duration distribution rate parameters for the data 
shown in Figure [2] 
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Figure 4: Mean number of transitions considered per time point by the beam 
sampler for 1000 post-burn-in sweeps on data from Figure [3j Consider this in 
comparison to the (KT)'^ = O(IO^) per time point transitions that would need 
to be considered by standard forward backward without truncation, a surely- 
safe, truncation-free, but computationally impractical alternative. 
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Figure 5: Beam sampler results from a system with identical observation dis- 
tributions but differing durations. Observations are shown in a); true states in 
b) overlaid with 20 state traces produced by the sampler. Here we have param- 
eters /xi = /i2 = 0, ^3 = 3 and Ai = 5, A2 = 15, A3 = 20. Samples from the 
posterior observation-mean and duration-rate distributions are shown in c) and 
d), respectively. 



the sampler clearly separates the high state associated with ^■^ from the other 
states and clearly reveals the presence of two low states with differing duration 
distributions. Figure [5b] shows posterior samples that indicate that the model 
is mixing over ambiguities about states and 1 as it should. 



5 Discussion 

We presented a beam sampler for the explicit state duration HMM. This sampler 
draws state sequences from the true posterior distribution without any need to 
make truncation approximations. It remains future work to combine the explicit 
state duration HMM and the iHMM. Python code associated with the EDHMM 
is available online Q 

^http://github.com/mikedewar/EDHMM 
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